CONL708: Applied Machine Learning Project: Code¶

Louis Othen - S21002027

Import relevant Libraries¶

In [ ]:
import pandas                   as pd                           # Used for Data manipulation/ Analysis.
import pandas_profiling         as pf                           # EDA tool against pandas data 
import os                       as os                           # For system related operations
import plotly.express           as px                           # For Data Visualisation

from sklearn.neighbors          import KNeighborsClassifier     # To implement K-Nearest Neighbors Algorithm
from sklearn.linear_model       import LogisticRegression       # To implement Logistic Regression Algorithm 
from sklearn.svm                import SVC                      # To implement Support Vector Machine Algorithm 

from sklearn.preprocessing      import StandardScaler           # Used to standardise values so features can be comparable/ less effect by outliers.
from sklearn.model_selection    import train_test_split         # Used to split datasets ready for modelling

from sklearn.metrics            import classification_report 
from sklearn.metrics            import confusion_matrix         # To show Confusion matrix of model prediction results 
from sklearn.metrics            import accuracy_score           # to show accuracy core of model(s)
from sklearn.metrics            import roc_auc_score ,auc,roc_curve # to show accuracy core of model(s)
                                             

Now that all relevant libraries have been imported into the notebook for the task at hand, the next step is to load in the downloaded titanic datasets from Kaggle, into a pandas dataframe.

1- Load in and preview Titanic dataset(s)¶

In [ ]:
# Go to working directory
#-----------------------------------------------------------------
folder_path = 'C:\\Users\\lothe\\OneDrive\\Wrexham Uni (Masters)\CONL708 - Machine Learning\Summative Assignments\\titanic'
os.chdir(folder_path)
In [ ]:
# Load in titanic data
#------------------------------------------------------------------
titanic_data    = pd.read_csv('train.csv')
new_titanic     = pd.read_csv('test.csv')
In [ ]:
# Preview train dataset
#------------------------------------------------------------------
display(titanic_data.head())
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [ ]:
# Preview test dataset
#------------------------------------------------------------------
display(new_titanic.head())
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
In [ ]:
pf.ProfileReport(titanic_data)
Summarize dataset: 100%|██████████| 52/52 [00:02<00:00, 23.59it/s, Completed]                        
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.78it/s]
Out[ ]:

2 - Split Training data¶

First, we need to separate the target variable (Survived) into a new variable (Y) incorporated later for use in the modelling phase; away from the independent variables, which will go into another variable (X_train).

In [ ]:
titan_train = titanic_data.copy()                                    # Take a copy of titanic data, to preserve the original intact 
titan_train = titan_train[[
                        'PassengerId'
                        ,'Pclass'
                        ,'Name'
                        ,'Sex'
                        ,'Age'
                        ,'SibSp'
                        ,'Parch'
                        ,'Ticket'
                        ,'Fare'
                        ,'Cabin'
                        ,'Embarked'
                        ]]
                        
Y = titanic_data['Survived']                                       # To store training label                                                

3 - Pre-processing steps¶

Now that the training and test versions of the titanic dataset has been successfully loaded, the next stage can begin of performing transformations on the data, so that it is ready for modelling against. These steps could include, how missing values are handled, placing numerical values onto the same scale, removing unnecessary columns and so forth.

3.1 - Removal of columns¶

From an initial glance, there appear to some columns in both datasets that cannot be used, particularly Name, Ticket, and PassengerID. An argument could be made in that Name could show indicators of survival, such as Doctor or Reverend, but may include bias at this stage , so will still be removed.

In [ ]:
# Removal of columns
#---------------------------------------------------------------------
titan_train.drop(columns = ['PassengerId','Ticket','Name'], inplace = True)

3.2 - Conversion of Sex with one-hot encoding¶

The next attribute to deal is with Sex; this appears to be a potential important feature for our modelling purposes, but cannot be used in its current format. Therefore we can employ the use of one-hot encoding to convert into a numeric binary value. The new values will show as 1 (male) and 0 (female.)

In [ ]:
# Conversion of Sex column with one-hot encoding
#---------------------------------------------------------------------
sex_dummy_tr                = pd.get_dummies(titan_train.Sex)               # One-hot encoding for training Data
titan_train['Gender']       = sex_dummy_tr.male                             # Add converted column back to training data            
titan_train.drop(columns    = 'Sex', inplace = True)                        # Remove older column from training data

3.3 - Conversion of Cabin into a binary output¶

Now to focus on the Cabin attribute, we have seen from the EDA report that there appears to be a mixture of passengers assigned or recorded to have a cabin, and those who did not. For this reason, it seems inadvisable to remove the data, as it could prove to eb an important feature the models may consider at to who would survive the disaster. With that in mind, instead of removing any data here, the aim is to conver it to a binary value with haing a cabin (1) or not (0). The first step would be to convert any NULL values into a 0, and observations with cabin numbers assigned to 1.

In [ ]:
titan_train['Cabin'].fillna(0,inplace = True)                                   # For any Cabin values that are NULL/NaN, replace with 0         
titan_train['Cabin'] = titan_train['Cabin'].apply(lambda x: 1 if x != 0 else x) # If the Cabin value is not 0, then assign it as 1

3.4 - Handling the missing values within Age¶

As can be seen from the data, there are values missing from the Age column within the dataset. Therefore in the first instance, rather than removing the rows with missing values - instead data imputation can be performed to provide the mean value across the values to fill in these gaps. This ensures as much of the available data can be used, without losing information that could be modelled upon.

In [ ]:
titan_train['Age'] = titan_train['Age'].apply(lambda x: titan_train['Age'].mean().round(0) if pd.isnull(x) else x) 

3.5 - SibSp and Parch Columns¶

Based on two attributes found in the dataset, they appear to be similar in nature. SibSp - represents the number of siblings or spouses aboard with the passenger; whilst Parch describes the number of parents or children aboard with the passenger. Therefore, these two attributes can be reformed into one feature known as family size, to reduce cardinality slightly. One has been added to the result, to represent a passenger traveling alone otherwise.

In [ ]:
titan_train['Family_size'] = titan_train['SibSp'] + titan_train['Parch'] + 1        # Sums the two attributes together, plus one to represent a passenger traveling alone.
titan_train.drop(columns=['SibSp','Parch'],inplace = True)                          # Remove SibSp and Parch columns

3.6 - One-hot encoding the Embarked Column¶

The second to final pre-processing step to be taken before the modelling phase can begin. With the Embarked column - that represents the port in which the passenger boarded the titanic from- either Southampton in the UK (S) , Cherbourg in Normandy (C) , or Queenstown - now known as Cobh in Ireland (Q). However, in its current form, most models would not be able to take these categorical values as they are, but through one-hot encoding once again, this can be achieved.

In [ ]:
embarked_dummies            = pd.get_dummies(titan_train.Embarked)          # Perform One-hot encoding on Embarked Column 
titan_train['Emb_Southampton']  = embarked_dummies['S']                     # Column for passengers who embarked from Southampton
titan_train['Emb_Cherbourg']    = embarked_dummies['C']                     # Column for passengers who embarked from Cherbourg
titan_train['Emb_Queenstown']   = embarked_dummies['Q']                     # Column for passengers who embarked from Queenstown

titan_train.drop(columns='Embarked', inplace = True)

3.7 - Scaling¶

The Last step in preprocessing, which involves scaling all input values to standardise them.

In [ ]:
sc = StandardScaler()                                                       # Prepares for scaling to commence
X = sc.fit_transform(titan_train)                                           # Scales the training data.

4 - Prepare Train test split for the titanic train dataset¶

Now all the preprocessing is complete, the data from the train.csv file can be split into training and testing datasets to run the three ML models against. Note that the data from test.csv is used to perform predictions where the model(s) have not seen the data previously. Will split the train.csv data into 70% used for training the model, and 30% test the model performance.

In [ ]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size= 0.33,random_state = 27)

5.1 - Modelling - K-Nearest Neighbors (K-NN)¶

for the K-NN model, a for loop shall be iterated between values 1-20, to see what number is the most optimal to provide as k.

In [ ]:
knn_results     = pd.DataFrame()
accuracy        = []
confusion_mat   = []
k               = []
class_rpt       = []


# Loop through k between 1-20 finding best k to based on accuracy
#------------------------------------------------------------
for i in range(1,21):
    knn = KNeighborsClassifier(n_neighbors= i)
    
    knn.fit(X_train,Y_train)
    knn_y_pred = knn.predict(X_test)
    k.append(i)
    accuracy.append(accuracy_score(Y_test,knn_y_pred))
    confusion_mat.append(confusion_matrix(Y_test,knn_y_pred))
    class_rpt.append(classification_report(Y_test,knn_y_pred,output_dict=True))

# Place acquired metrics into summary table
# ----------------------------------------------------------   
knn_results['k'] = k
knn_results['accuracy'] = accuracy
knn_results['confusion_matrix'] = confusion_mat

# Sort dataframe to show highest accuracy score on top 
#-----------------------------------------------------------
knn_results = knn_results.sort_values(by = ['accuracy'],ascending = False)

Now the above code has executed, the results have been stored into a dataframe, showing the k used, as well as accuracy score and confusion matrix against it.

In [ ]:
knn_results
Out[ ]:
k accuracy confusion_matrix
5 6 0.844068 [[173, 14], [32, 76]]
9 10 0.837288 [[170, 17], [31, 77]]
15 16 0.837288 [[170, 17], [31, 77]]
19 20 0.833898 [[168, 19], [30, 78]]
6 7 0.833898 [[166, 21], [28, 80]]
7 8 0.833898 [[170, 17], [32, 76]]
13 14 0.833898 [[171, 16], [33, 75]]
17 18 0.830508 [[167, 20], [30, 78]]
16 17 0.830508 [[166, 21], [29, 79]]
8 9 0.830508 [[165, 22], [28, 80]]
12 13 0.827119 [[164, 23], [28, 80]]
18 19 0.827119 [[165, 22], [29, 79]]
10 11 0.827119 [[163, 24], [27, 81]]
3 4 0.827119 [[171, 16], [35, 73]]
11 12 0.823729 [[168, 19], [33, 75]]
14 15 0.823729 [[164, 23], [29, 79]]
4 5 0.823729 [[163, 24], [28, 80]]
2 3 0.820339 [[161, 26], [27, 81]]
0 1 0.776271 [[155, 32], [34, 74]]
1 2 0.769492 [[174, 13], [55, 53]]

Based on the iteration above, it appears that k = 6 is the most optimal number to provide as part of the model build. In light of this, this configuration can now be applied to make the model formally.

In [ ]:
KNN = KNeighborsClassifier(n_neighbors= 6)
KNN.fit(X_train,Y_train)
KNN_y_pred = KNN.predict(X_test)
print(confusion_matrix(Y_test,KNN_y_pred))
print(accuracy_score(Y_test,KNN_y_pred))

print(classification_report(Y_test,KNN_y_pred))
[[173  14]
 [ 32  76]]
0.8440677966101695
              precision    recall  f1-score   support

           0       0.84      0.93      0.88       187
           1       0.84      0.70      0.77       108

    accuracy                           0.84       295
   macro avg       0.84      0.81      0.83       295
weighted avg       0.84      0.84      0.84       295

5.2 - Modelling - Logistic Regression¶

Now to build a model and show the performance on the training data using Logistic Regression.

In [ ]:
regr = LogisticRegression(solver='liblinear', random_state=1)
regr.fit(X_train,Y_train)
log_y_pred = regr.predict(X_test)

print(confusion_matrix(Y_test,log_y_pred))

print("Accuracy for training data : ",accuracy_score(Y_test,log_y_pred))
print("Auc score : " , roc_auc_score(Y_test,regr.predict_proba(X_test)[:, 1]))
print(classification_report(Y_test,log_y_pred))
[[160  27]
 [ 33  75]]
Accuracy for training data :  0.7966101694915254
Auc score :  0.852619330560507
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       187
           1       0.74      0.69      0.71       108

    accuracy                           0.80       295
   macro avg       0.78      0.78      0.78       295
weighted avg       0.79      0.80      0.80       295

Based on the predictions of the logistic regression model, the accuracy score comes in at approximately 79.66%., with a AUC score of 77.5%

In [ ]:
fpr, tpr, thresholds = roc_curve(Y_test, log_y_pred)
fig = px.area(
    x=fpr, y=tpr,
    title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
    labels=dict(x='False Positive Rate', y='True Positive Rate'),
    width=700, height=500
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1
)

fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()

5.3 - Modelling - Support Vector Machines (SVM)¶

Finally, to create a model on the data, via the use SVM.

In [ ]:
clf_svm = SVC(gamma='auto')
clf_svm.fit(X_train,Y_train)
svm_y_pred = clf_svm.predict(X_test)

print(confusion_matrix(Y_test,svm_y_pred))
print("Accuracy for training data : ",accuracy_score(Y_test,svm_y_pred))
print(classification_report(Y_test,svm_y_pred))
[[166  21]
 [ 28  80]]
Accuracy for training data :  0.8338983050847457
              precision    recall  f1-score   support

           0       0.86      0.89      0.87       187
           1       0.79      0.74      0.77       108

    accuracy                           0.83       295
   macro avg       0.82      0.81      0.82       295
weighted avg       0.83      0.83      0.83       295

In [ ]:
fpr, tpr, thresholds = roc_curve(Y_test, svm_y_pred)
fig = px.area(
    x=fpr, y=tpr,
    title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
    labels=dict(x='False Positive Rate', y='True Positive Rate'),
    width=700, height=500
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1
)

fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()

6 - Apply model(s) to new data¶

Now that we have created the models against the training data from the titanic dataset, now they need to be applied to the test data ,as if the model is seeing new data. However before we can commence, the pre-processing steps need to be repeated agains the new data ( placed into a function this time for ease).

In [ ]:
def preprocess(df):
    df.drop(columns         = ['PassengerId','Ticket','Name'], inplace = True)                              # Removal of unneeded columns
    sex_dummy_te            = pd.get_dummies(df.Sex)                                                        # One-hot encoding on Sex columns
    df['Gender']            = sex_dummy_te.male                                                             # Add converted Sex column back 
    df.drop(columns         = 'Sex', inplace = True)                                                        # Remove Sex column
    df['Cabin'].fillna(0,inplace = True)                                                                    # For any Cabin values that are NULL/NaN, replace with 0
    df['Cabin']             = df['Cabin'].apply(lambda x: 1 if x != 0 else x)                               # If the Cabin value is not 0, then assign it as 1
    df['Age']               = df['Age'].apply(lambda x: df['Age'].mean().round(0) if pd.isnull(x) else x)   # If any values missing from Age column, impute the average value to fill in the blank. 
    df['Family_size']       = df['SibSp'] + df['Parch'] + 1                                                 # Sums the two attributes together, plus one to represent a passenger traveling alone.
    df.drop(columns=['SibSp','Parch'],inplace = True)                                                       # Remove SibSp and Parch columns
    embarked_dummies        = pd.get_dummies(df.Embarked)                                                   # Perform One-hot encoding on Embarked Column 
    df['Emb_Southampton']   = embarked_dummies['S']                                                         # Column for passengers who embarked from Southampton
    df['Emb_Cherbourg']     = embarked_dummies['C']                                                         # Column for passengers who embarked from Cherbourg
    df['Emb_Queenstown']    = embarked_dummies['Q']                                                         # Column for passengers who embarked from Queenstown
    df.drop(columns='Embarked', inplace = True)                                                             # Drop Embarked Column 
    sc = StandardScaler()                                                                                   # Prepares for scaling to commence
    df = sc.fit_transform(df)                                                                               # Scales the data.
    return df
In [ ]:
new_data = preprocess(new_titanic)